Building a Simple Linear Model to Better Understand Regression Methods
Get in Loser, We’re Fitting Lines to Data
model <- lm(exam_score ~ hours_studied, data = df)
df <-
df |>
mutate(
fitted = model$fitted.values,
residual = model$residuals
)
df |>
select(hours_studied, exam_score, fitted, residual) |>
janitor::clean_names(case = "title") |>
slice_sample(n = 10) |>
gt() |>
fmt_number(columns = c(Fitted, Residual), decimals = 2) |>
cols_align(align = "center", columns = everything())| Hours Studied | Exam Score | Fitted | Residual |
|---|---|---|---|
| 24 | 67 | 68.25 | −1.25 |
| 26 | 69 | 68.83 | 0.17 |
| 35 | 69 | 71.46 | −2.46 |
| 22 | 71 | 67.66 | 3.34 |
| 24 | 70 | 68.25 | 1.75 |
| 26 | 72 | 68.83 | 3.17 |
| 13 | 63 | 65.03 | −2.03 |
| 24 | 68 | 68.25 | −0.25 |
| 19 | 64 | 66.79 | −2.79 |
| 32 | 66 | 70.58 | −4.58 |
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
slice_sample(n = 10) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
slice_sample(n = 100) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")\(\text{Residual Sum of Squares (RSS)} = \sum_{i=1}^n \left( y_i - (\beta_0 + \beta_1 x_i) \right)^2\)
Minimising RSS gives us the line that best fits the data, but we don’t know what \(\beta_0\) or \(\beta_1\) are!
Minimize RSS (Solve for \(\beta_0\), then \(\beta_1\))
Our good friends \(\beta_1\), \(\beta_0\), and \(\epsilon\).
\[Y = \beta_0 + \beta_1 X + \epsilon\]
\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
\[\beta_0 = \bar{y} - \beta_1 \bar{x} \]
\[\hat{y}_i = \beta_0 + \beta_1 x_i \]
What Happens When We Add More Predictors?
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon\]
Where Next, Magic Math Man?
Contact:
Code & Slides:
Paul Johnson // Linear Regression from Scratch // Nov 28, 2024